The Index Shotgun antipattern is about creating or dropping indexes without reason, so let’s come up with ways to analyze a database and find good reasons to include indexes or omit them.
We can use the mnemonic “MENTOR” to describe a checklist for analyzing our database for good index choices: Measure, Explain, Nominate, Test, Optimize, and Rebuild.
Measure#
We can’t make informed decisions without information. Most databases provide some way to log the time it takes to execute SQL queries so we can identify the operations with the greatest cost. For example:
-
Microsoft SQL Server and Oracle both have SQL Trace facilities as well as tools to report and analyze trace results. Microsoft calls this tool the SQL Server Profiler, and Oracle calls it TKProf.
-
MySQL and PostgreSQL can log queries that take longer to execute than a specified threshold of time. MySQL calls this the slow query log, and its
long_query_time
configuration parameter defaults to seconds. PostgreSQL has a similar configuration variablelog_min_duration_statement
. PostgreSQL also has a companion tool called pgFouine, which helps us analyze the query log and identify queries that need attention.
Once we know which queries account for the most time in our application, we know where we should focus our optimizing attention for the greatest benefit. We might even find that all queries are working efficiently except for a single bottleneck query. This is the query we should start optimizing.
The database isn’t always the bottleneck
The area of greatest cost in our application isn’t necessarily the most time-consuming query if that query is run only rarely. Other simpler queries may run much more frequently (often unexpectedly) and account for the longest amount of time. Giving attention to optimizing these queries can be the most beneficial decision to take.
While measuring query performance, we should disable all query result caching. This type of cache is designed to bypass query execution and index usage, leading to inaccurate measurements.
We can get more accurate information by profiling our application after we deploy it. We can do this by collecting aggregated data of where the code spends its time when real users are using it as compared to when it’s running against the real database. We should monitor profiling data from time to time to make sure that we haven’t acquired a new bottleneck.
We must also remember to disable or turn down the reporting rate of profilers after we have finished measuring because these tools incur overhead costs.
Explain#
After identifying the query with the greatest cost, our next step is to find out why it’s so slow. Every database uses an optimizer to pick indexes for our query. We can get the database to give us a report of its analysis, called the query execution plan (QEP).
The syntax to request a QEP varies by database brand. The different syntaxes are given in the table below.
Database Brand | QEP Reporting Solution |
---|---|
IBM DB2 | EXPLAIN , db2expln command, or Visual Explain |
Microsoft SQL Server | SET SHOWPLAN_XML , or Display Execution Plan |
MySQL | EXPLAIN |
Oracle | EXPLAIN PLAN |
PostgreSQL | EXPLAIN |
SQLite | EXPLAIN |
There’s no standard for the information that a QEP report contains or the format of the report. In general, the QEP shows us which tables are involved in a query, how the optimizer chooses to use indexes, and in what order it will access the tables. The report may also include statistics, such as the number of rows generated by each stage of the query.
Let’s look at a sample SQL query and request a QEP report:
In the output, the key
column shows that this query makes use of only the primary key index BugsProducts
. The extra notes in the last column indicate that the query will sort the results in a temporary table without the benefit of an index.
The LIKE
expression forces a full table scan in Bugs
, and there is no index on Products.product_name
. We can improve this query if we create a new index on product_name
and also use a full-text search solution.
The information in a QEP report is vendor-specific. For this example, we can read the MySQL manual page “Optimizing Queries with EXPLAIN” to understand how to interpret the report.
Nominate#
Now that we have the optimizer’s QEP for our query, we should look for cases where the query accesses a table without using an index.
Some databases have tools that do this for us by collecting query trace statistics and proposing a number of changes, including the creation of new indexes that are missing but that would benefit our query. A few examples are:
-
IBM DB2 Design Advisor
-
Microsoft SQL Server Database Engine Tuning Advisor
-
MySQL Enterprise Query Analyzer
-
Oracle Automatic SQL Tuning Advisor
Even without automatic advisors, we can learn how to recognize which index could benefit a query. We need to study our database’s documentation to interpret the QEP report.
Test#
This step is important: after creating indexes, we must profile our queries again. It’s important to confirm that our changes made a difference so that we can know that our work is done.
We can also use this step to justify the work we put into this optimization. After all, we don’t want our weekly status to be something like “I’ve tried everything I can think of to fix our performance issues, and we’ll just have to wait and see….” if we have the opportunity to present it like this instead: “I determined that we could create one new index on a high-activity table, and I improved the performance of our critical queries by 38 percent.”
Optimize#
Indexes are compact, frequently used data structures, making them good candidates for keeping in cache memory. Reading indexes from memory improves performance much more than reading indexes from disk I/O.
Covering Indexes
Database servers allow us to configure the amount of system memory to allocate for caching. Most databases set the cache buffer size quite low to ensure that the database works well on a wide variety of systems. But it’s probably a good idea to raise the size of the cache.
How much memory should we allocate to cache? There’s no single answer to this question because it depends on the size of our database and how much system memory we have available.
We may also benefit from preloading indexes into cache memory instead of relying on database activity to bring the most frequently used data or indexes into the cache. For instance, on MySQL, we can use the LOAD INDEX INTO CACHE
statement.
Rebuild#
Indexes provide the greatest efficiency when they are balanced. Over time, as we update and delete rows, the indexes may become progressively imbalanced, similar to how file systems become fragmented over time. In practice, we may not see a large difference between an index that is optimal vs. one that has some imbalance. But we want to get the most out of indexes, so it’s worthwhile to perform maintenance on a regular basis.
Like most features related to indexes, each database brand uses vendor-specific terminology, syntax, and capabilities.
Database Brand | Index Maintenance Command |
---|---|
IBM DB2 | REBUILD INDEX |
Microsoft SQL Server | ALTER INDEX ... REORGANIZE , ALTER INDEX ... REBUILD , or DBCC DBREINDEX |
MySQL | ANALYZE TABLE or OPTIMIZE TABLE |
Oracle | ALTER INDEX ... REBUILD |
PostgreSQL | VACUUM or ANALYZE |
SQLite | VACUUM |
How frequently should we rebuild an index? We might hear generic answers such as “once a week,” but in truth, there’s no single answer that fits all applications. It depends on how frequently we commit changes to a given table that could introduce imbalance. It also depends on how large the table is and how important it is to get optimal benefit from indexes for this table. Is it worth spending hours rebuilding indexes for a large but seldom used table if we can expect to gain only an extra 1 percent performance? At the end of the day, we ourselves are the best judges of this because we know our data and our operation requirements better than anyone else.
A lot of the knowledge about getting the most out of indexes is vendor-specific, so we’ll need to research the brand of database we use. Our resources include the database manual, books and magazines, blogs and mailing lists, and also lots of our own experimentation. The most important rule is that guessing blindly at indexing isn’t a good strategy.